Introduction to Open Data Science: final project

Abstract

The idea of this project is to deepen the know-how of an analysis method and show the practial application of the knowledge gained during the IODS course.

In the following chapter I explore the relations between the data in the Hobbies set of the FactoMineR package, using the MCA analysis.

Description of the research questions

The goal of this project is to explore how hobbies relate to the variables describing individuals (gender, age, profession and marital status) in the dataset. I am particularly interested in seeing how the following hypotheses relate to my data:

  • Young people should veer towards hobbies typical for their age, such as computers and sports, while older people should show more interest in gardening or knitting;
  • Gender could show as an important factor of choice of hobbies (fishing and mechanical tinkering as typically male, while knitting a standard female hobby).

Description of the original data and the processed data set

As mentioned above, this project uses data from the FactoMineR package.

One of the suggested datasets for this final assignment was the Hobbies set, which contains an extract of a 2003 “Histoire de vie” questionnaire conducted by the French National Institute of Statistics, l’INSEE. In this part of the study, 8403 individuals aged 15 or more were asked 18 questions about their hobbies. The following 4 variables were used to label the respondents:

  • sex (male, female),
  • age (15-25, 26-35, 36-45, 46-55, 56-65, 66-75, 76-85, 86-100),
  • marital status (single, married, widowed, divorced, remarried),
  • profession (manual labourer, unskilled worker, technician, foreman, senior management, employee, other),
  • a quantitative variable indicating the number of hobbies practised out of the 18 possible choices.

The question concernig the hobbies was the following: “Have you done or been involved in the following hobby in the past 12 months, without ever have been obliged to do it?” The dataset included in the FactoMineR package is a data frame with 8403 rows and 23 columns. The rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 4 following ones are supplementary categorical variables (describing the respondents) and the 23th is a supplementary quantitative variable (the number of activities).

Under these links my processed data in the .csv format and the script used to process the data can be found.

The data wrangling included the following steps:

  • Removing the data points without answers - this way only complete cases are left in the data set.
  • Renaming the fwo-factor levels, which are either 1 or 0, to ‘yes’ and ‘no’. This way they are more convenient to use as labels.
  • Keeping some variables and leaving out the others. I am interested in the most of the variables, but music-related and TV-related data was left out (the assumption being that these are fairly common hobbies in all age groups and social classes), and so was the number of activities.

The pre-processed dataset contains 6905 observations of 19 variables.

Exploring the dataset in more detail

In this part I will show some clear and interesting explorations of the variables of interest in the Hobbies data.

library(dplyr)
hobbies<- read.table("C:\\Users\\E130-WIN7\\Documents\\hobbies.csv", sep = ",", header = TRUE)
hobbies<-dplyr::select(hobbies, -X)
summary(hobbies)
##  Reading    Cinema      Show      Exhibition Computer   Sport     
##  No :2265   No :4135   No :4901   No :4746   No :4296   No :4361  
##  Yes:4640   Yes:2770   Yes:2004   Yes:2159   Yes:2609   Yes:2544  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  Walking    Travelling Collecting Volunteering Mechanic   Gardening 
##  No :3378   No :4098   No :6148   No :5820     No :3868   No :4068  
##  Yes:3527   Yes:2807   Yes: 757   Yes:1085     Yes:3037   Yes:2837  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  Knitting   Cooking    Fishing    Sex           Age         Marital.status
##  No :5725   No :3829   No :6122   F:3772   (45,55]:1624   Divorcee : 712  
##  Yes:1180   Yes:3076   Yes: 783   M:3133   (35,45]:1455   Married  :3631  
##                                            (25,35]:1183   Remarried: 352  
##                                            (55,65]:1067   Single   :1609  
##                                            (65,75]: 713   Widower  : 601  
##                                            [15,25]: 456                   
##                                            (Other): 407                   
##             Profession  
##  Employee        :2552  
##  Foreman         : 735  
##  Management      :1052  
##  Manual labourer :1161  
##  Other           : 212  
##  Technician      : 401  
##  Unskilled worker: 792
library(tidyr) 
library(ggplot2)
hobbies_general<-dplyr::select(hobbies, -Age, -Marital.status, -Profession, -Sex)
gather(hobbies_general) %>% ggplot(aes(value)) + ggtitle("Distribution of hobbies across all respondents") + facet_wrap("key", scales = "free") + geom_bar(fill = "darkolivegreen") + theme(text = element_text(size=15),
        axis.text.x = element_text(angle=45, hjust=1))

It is interesting to see how the hobbies are distributed and how some of them are clearly more popular than others - the above plots show the data across all respondents.

There are some interesting observations I had not predicted. Collecting, fishing and knitting are not popular; reading is a hobby where the number of people doing it is much higher than those who do not. Volunteering is also unpopular.

library(tidyr) 
library(ggplot2)
individuals<-dplyr::select(hobbies, -Cinema, -Collecting, -Computer, -Cooking, -Exhibition, -Fishing, -Gardening, -Knitting, -Mechanic, -Reading, -Show, -Sport, -Travelling, -Volunteering, -Walking)
gather(individuals) %>% ggplot(aes(value)) + ggtitle("Characteristics of respondents") + facet_wrap("key", scales = "free") + geom_bar(fill = "darkseagreen") + theme(text = element_text(size=15),
        axis.text.x = element_text(angle=45, hjust=1))

We can see that age of the participants is quite evenly distributed, with more young people than old ones. Most of the questionnaire participants are 25-65 years old, while the smallest group is, unsurprisingly, 86-100. The gender distribution is quit even, with slightly more females. Most of the respondents are married. For profession, the most common situation is employee, which means a non-manual type of work.

Description of the methods

Multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space.

Results: visualisations and interpretation

library(FactoMineR)
mca <- MCA(hobbies, graph = FALSE)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = hobbies, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.183   0.115   0.104   0.076   0.065   0.059
## % of var.             10.537   6.595   5.978   4.372   3.721   3.405
## Cumulative % of var.  10.537  17.132  23.110  27.482  31.203  34.608
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.057   0.055   0.055   0.054   0.053   0.052
## % of var.              3.260   3.184   3.169   3.116   3.070   3.012
## Cumulative % of var.  37.868  41.052  44.221  47.337  50.407  53.419
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.052   0.052   0.050   0.050   0.049   0.046
## % of var.              2.994   2.968   2.903   2.881   2.798   2.664
## Cumulative % of var.  56.413  59.382  62.284  65.166  67.963  70.627
##                       Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
## Variance               0.045   0.043   0.042   0.041   0.038   0.038
## % of var.              2.588   2.469   2.419   2.339   2.196   2.173
## Cumulative % of var.  73.216  75.685  78.104  80.443  82.639  84.812
##                       Dim.25  Dim.26  Dim.27  Dim.28  Dim.29  Dim.30
## Variance               0.036   0.034   0.033   0.032   0.031   0.030
## % of var.              2.092   1.940   1.927   1.831   1.768   1.735
## Cumulative % of var.  86.904  88.845  90.772  92.603  94.371  96.106
##                       Dim.31  Dim.32  Dim.33
## Variance               0.029   0.022   0.017
## % of var.              1.658   1.264   0.972
## Cumulative % of var.  97.763  99.028 100.000
## 
## Individuals (the 10 first)
##                   Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1              |  0.690  0.038  0.276 |  0.187  0.004  0.020 |  0.275
## 2              | -0.001  0.000  0.000 |  0.106  0.001  0.005 | -0.049
## 3              | -0.077  0.000  0.006 |  0.069  0.001  0.005 |  0.003
## 4              | -0.548  0.024  0.282 | -0.445  0.025  0.186 |  0.288
## 5              | -0.173  0.002  0.029 | -0.404  0.021  0.158 | -0.068
## 6              |  0.578  0.026  0.235 | -0.070  0.001  0.003 | -0.422
## 7              |  0.076  0.000  0.005 | -0.009  0.000  0.000 | -0.343
## 8              |  0.062  0.000  0.002 | -0.124  0.002  0.009 | -0.926
## 9              |  0.235  0.004  0.022 |  0.390  0.019  0.061 | -0.057
## 10             |  0.484  0.019  0.197 |  0.147  0.003  0.018 | -0.334
##                   ctr   cos2  
## 1               0.011  0.044 |
## 2               0.000  0.001 |
## 3               0.000  0.000 |
## 4               0.012  0.078 |
## 5               0.001  0.004 |
## 6               0.025  0.125 |
## 7               0.016  0.100 |
## 8               0.120  0.512 |
## 9               0.000  0.001 |
## 10              0.016  0.094 |
## 
## Categories (the 10 first)
##                    Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## Reading_No     |  -0.646   3.937   0.204 -37.501 |  -0.474   3.379   0.109
## Reading_Yes    |   0.315   1.922   0.204  37.501 |   0.231   1.650   0.109
## Cinema_No      |  -0.524   4.726   0.410 -53.181 |   0.056   0.087   0.005
## Cinema_Yes     |   0.782   7.055   0.410  53.181 |  -0.084   0.130   0.005
## Show_No        |  -0.385   3.027   0.363 -50.035 |  -0.075   0.183   0.014
## Show_Yes       |   0.942   7.402   0.363  50.035 |   0.183   0.448   0.014
## Exhibition_No  |  -0.412   3.362   0.374 -50.804 |  -0.128   0.515   0.036
## Exhibition_Yes |   0.907   7.390   0.374  50.804 |   0.281   1.132   0.036
## Computer_No    |  -0.465   3.874   0.357 -49.614 |   0.177   0.891   0.051
## Computer_Yes   |   0.766   6.379   0.357  49.614 |  -0.291   1.467   0.051
##                 v.test     Dim.3     ctr    cos2  v.test  
## Reading_No     -27.489 |  -0.013   0.003   0.000  -0.763 |
## Reading_Yes     27.489 |   0.006   0.001   0.000   0.763 |
## Cinema_No        5.703 |   0.241   1.763   0.087  24.464 |
## Cinema_Yes      -5.703 |  -0.360   2.632   0.087 -24.464 |
## Show_No         -9.742 |   0.040   0.058   0.004   5.218 |
## Show_Yes         9.742 |  -0.098   0.142   0.004  -5.218 |
## Exhibition_No  -15.732 |  -0.083   0.240   0.015 -10.214 |
## Exhibition_Yes  15.732 |   0.182   0.526   0.015  10.214 |
## Computer_No     18.826 |   0.160   0.809   0.042  17.075 |
## Computer_Yes   -18.826 |  -0.264   1.332   0.042 -17.075 |
## 
## Categorical variables (eta2)
##                  Dim.1 Dim.2 Dim.3  
## Reading        | 0.204 0.109 0.000 |
## Cinema         | 0.410 0.005 0.087 |
## Show           | 0.363 0.014 0.004 |
## Exhibition     | 0.374 0.036 0.015 |
## Computer       | 0.357 0.051 0.042 |
## Sport          | 0.321 0.028 0.018 |
## Walking        | 0.161 0.029 0.054 |
## Travelling     | 0.357 0.010 0.018 |
## Collecting     | 0.028 0.000 0.012 |
## Volunteering   | 0.101 0.011 0.046 |

The Eigenvalues show that the first dimension has 10% of the total variance and the second one retains 8%.

None of the squared correlation between the variables and dimensions are close to 1, meaning that there doesn?t seem to be any strong correlations between them. The strongest seem to be computer, sport and travelling to dimension 1, while knitting and fishing seem to be the strongest related to dimension 2.

The following plots illustrate the dominant dimensions of the data and a more detailed view of the variables which are related to the two dimensions.

library("factoextra")
res.mca <- MCA(hobbies, graph = FALSE)
eig.val <- get_eigenvalue(res.mca)
 # head(eig.val)
fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))

fviz_mca_var(res.mca, choice = "mca.cor", 
            repel = TRUE, 
            ggtheme = theme_minimal())

The following plot is the factor map.

In here, most of the variables are concentrated in the middle. Variables such as being a widower, a preference for knitting, and old age are far away from the centre and close to the dimension 1. Being a technician or a manual labourer, fishing, and a young male are factors contributing to dimension 2.

# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

Each hobby

Here, each hobby is represented separately as a plot - pink meaning yes and blue meaning no.

plotellipses(mca, keepvar = 1:5, means = FALSE, label = "quali")

plotellipses(mca, keepvar = 6:10, means = FALSE, label = "quali")

plotellipses(mca, keepvar = 11:15, means = FALSE, label = "quali")

Not having a hobby is connected with the left-hand side, while having a hobby is related to the right-hand side.

plotellipses(mca, keepvar = 16, means = FALSE, label = "quali")

plotellipses(mca, keepvar = 17, means = FALSE, label = "quali")

plotellipses(mca, keepvar = 18, means = FALSE, label = "quali")

plotellipses(mca, keepvar = 19, means = FALSE, label = "quali")

## Conclusions and discussion

Conclusions and discussion (max 2 points)